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SCAD-PENALIZED REGRESSION IN HIGH-DIMENSIONAL 
PARTIALLY LINEAR MODELS 

By Huiliang Xie and Jian Huang 
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We consider the problem of simultaneous variable selection and 
estimation in partially linear models with a divergent number of co- 
variates in the linear part, under the assumption that the vector of re- 
gression coefficients is sparse. We apply the SCAD penalty to achieve 
sparsity in the linear part and use polynomial splines to estimate the 
nonparametric component. Under reasonable conditions, it is shown 
that consistency in terms of variable selection and estimation can 
be achieved simultaneously for the linear and nonparametric compo- 
nents. Furthermore, the SCAD-penalized estimators of the nonzero 
coefficients are shown to have the asymptotic oracle property, in the 
sense that it is asymptotically normal with the same means and co- 
variances that they would have if the zero coefficients were known in 
advance. The finite sample behavior of the SCAD-penalized estima- 
tors is evaluated with simulation and illustrated with a data set. 

1. Introduction. Consider a partially linear model (PLM) 

Y = X'f3 + g{T)+e, 

where (3 is a p x 1 vector of regression coefficients associated with X, and 
g is an unknown function of T. In this model, the mean response is linearly 
related to X, while its relation with T is not specified up to any finite 
number of parameters. This model combines the flexibility of nonparametric 
regression and parsimony of linear regression. When the relation between Y 
and X is of main interest and can be approximated by a linear function, it 
offers more interpretability than a purely nonparametric model. 

We consider the problem of simultaneous variable selection and estima- 
tion in the PLM when p is large, in the sense that p — > oo as the sample size 
n — > oo. For finite-dimensional /3, several approaches have been proposed 



Received July 2006. 

AMS 2000 subject classifications. Primary 62J05, 62G08; secondary 62E20. 
Key words and phrases. Asymptotic normality, high-dimensional data, oracle property, 
penalized estimation, semiparametric models, variable selection. 

This is an electronic reprint of the original article published by the 

Institute of Mathematical Statistics in The Annals of Statistics, 

2009, Vol. 37, No. 2, 673-696. This reprint differs from the original in pagination 

and typographic detail. 



1 



2 



H. XIE AND J. HUANG 



to estimate (3 and g. Examples include the partial spline estimator [Wahba 
(1984), Engle et al. (1986) and Heckman (1986)] and the partial residual 
estimator [Robinson (1988), Speckman (1988) and Chen (1988)]. Under ap- 
propriate assumptions about the smoothness of g and the structure of X, 
these estimators of /3 were shown to be -^/n-consistent and asymptotically 
normal. It was also shown that the estimators of g can converge at the opti- 
mal rate in the purely nonparametric regression determined in Stone (1980, 
1982). Fan and Li (2004) considered variable selection in the semiparametric 
models in the context of longitudinal data analysis, assuming a framework 
with a fixed set of covariates as n increases. In these studies, either the 
dimension of the covariate vector X was fixed or the problem of variable 
selection in X via penalization was not considered. However, the results 
for the PLM with a finite-dimensional j3 and those for the semiparametric 
models in general are not applicable to the PLM with a divergent number 
of covariates. Indeed, it appears that there is no systematic theoretical in- 
vestigation of estimation in semiparametric models with a high-dimensional 
parametric component. 

We are particularly interested in /3 when it is sparse, in the sense that 
many of its elements are zero. Our work is motivated by biomedical stud- 
ies that investigate the relationship between a phenotype of interest and 
genomic measurements such as microarray data. In many such studies, in 
addition to genomic measurements, other types of measurements, such as 
clinical or environmental covariates, are also available. To obtain unbiased 
estimates of genomic effects, it is necessary to take into account these co- 
variates. Assuming a sparse model is often reasonable with genomic data. 
This is because, although the total number of measurements can be large, 
the number of important ones is usually relatively small. In these problems, 
selection of important covariates is often one of the most important goals in 
the analysis. The p — > oo framework allows us to address the concerns as to 
how the nonparametric term is going to affect the estimation and variable 
selection of /3, and whether the rate at which the nonparametric estimator 
converges can be maintained with a divergent p. 

We use the SCAD method to achieve simultaneous consistent variable 
selection and estimation of /3. The SCAD method is proposed by Fan and 
Li (2001) in a general parametric framework for variable selection and effi- 
cient estimation. This method uses a specially designed penalty function, the 
smoothly clipped absolute deviation (hence the name SCAD), as adopted 
in Fan and Li (2004). We estimate the nonparametric component g using 
the partial residual method with the B-spline bases. The resulting estima- 
tor of /9 maintains the oracle property of the SCAD-penalized estimators 
in parametric settings. Here, the oracle property means that the estimator 
can correctly select the nonzero coefficients with probability converging to 
one, and that the estimators of the nonzero coefficients are asymptotically 
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normal with the same means and covariances that they would have if the 
zero coefficients were known in advance. Therefore, an oracle estimator is 
asymptotically as efficient as the ideal estimator assisted by an oracle who 
knows which coefficients are nonzero. Meanwhile, convergence of the estima- 
tor of g in the SCAD-penalized partially linear regression still reaches the 
optimal global rate. 

Investigations on the asymptotic properties of penalized estimation in 
parametric models when the number of covariates is fixed include Knight 
and Fu (2000) and Fan and Li (2001). Fan and Peng (2004) considered the 
same problem when the number of parameters diverges, where they showed 
that there exist local maximizers of the penalized likelihood that have an 
oracle property. Huang, Horowitz and Ma (2008) studied the bridge esti- 
mators with a divergent number of covariates in a linear regression model 
and showed that the bridge estimators have an oracle property if the bridge 
index is strictly between and 1. Several recent studies have considered the 
asymptotic properties of the LASSO method in high-dimensional settings. 
Examples include: Meinshausen and Buhlmann (2006), van de Geer (2008), 
Zhang and Huang (2008) and Zhao and Yu (2006). In these studies, the 
convexity property of the LASSO penalty is critical to the results. However, 
since the SCAD penalty is not convex, the methods that utilize convexity 
are not applicable in the present setting. Furthermore, the PLM models we 
consider here are semiparametric. The asymptotic analysis of such semipara- 
metric models in high-dimensional settings appears to be considerably more 
complicated than those in the linear regression models. 

The rest of this article is organized as follows. In Section 2, we define 
the SCAD-penalized estimator (/3„,^„) in the PLM, abbreviated as SCAD- 
PLM estimator hereafter. The main results for the SCAD-PLM estimator 
are given in Section 3, including the consistency and oracle property of 
as well as the rate of convergence of Qn- Section 4 deals with computing 
the PLM-SCAD estimator. The finite sample behavior of this estimator is 
illustrated with simulation studies and a real data example in Section 5. 
Extensions and concluding remarks are given in Section 6. The proofs are 
relegated to the Appendix. 

2. Penalized estimation in PLM with the SCAD penalty. To make it 
explicit that the covariates and regression coefficients depend on n, we write 
the PLM 

Yi = + g{T,) + ei, i = 1, . . . ,n, 

where (X^-"^ , Tj, Yi) are independent and identically distributed as (X^") ,T,Y), 

and Ei is independent of (X^-"^ , Tj ) , with mean and variance . We assume 
that T takes values in a compact interval, and, for simplicity, we assume this 
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interval to be [0, 1]. Let Y = (Fi, . . . , Yn)' , and let X(") = {Xij, 1 < i < n, 1 < 
j ^ Pn) be the nx design matrix associated with /3^"^. In sparse models, 
the Pn covariates can be classified into two categories: the important ones 
whose corresponding coefficients are nonzero and the trivial ones that actu- 
ally are not present in the underlying model. For convenience of notation, 
we write 

(1) /3W = (/3W',/3(«)')', 

where /S^"^' = . and (B^^^' = (0, . . . , 0). Here kn{<Pn) is the num- 

ber of nontrivial covariates. Let m„ =Pn~ be the number of zero coeffi- 
cients. 

We use the polynomial splines to approximate g. For a positive inte- 
ger Mn, let A„ = {^ni'}^=i be a partition of [0,1] into M„ -|- 1 subinter- 
vals Inv = [inu, in,u+i) : = 0, . . . , M„ - 1 and /„m„ = [CnM„ , 1] • Here, = 
and in,Mn+i = 1- Denote the largest mesh size of A ^, maxo<v<A/„{Cn,v+i 
inv}-, by A„. Throughout the article, we assume A„, = 0{M~^). Let 5m(A„) 
be the space of polynomial splines of order m with simple knots at the points 
. . . ,(,nMn- This space consists of all functions s with these two proper- 
ties: 

(i) restricted to any interval Inu (0 < < M^), s is a polynomial of order 

m; 

(ii) ifm>2, sism — 2 times continuously differentiable on [0, 1]. 

According to Corollary 4.10 in Schumaker (1981), there is a local basis 
{Bnw, ^ < w < Qn} for 5m (A„), where qn = Mn + m is the dimension of 
cSm(A„). Let 

Z(t;A„)' = (i?„i(t),...,i?„,„(i)) 

and Z(") be the n X Qn matrix whose zth row is Z(Ti', A^)'. Any s G Sm{^n) 
can be written s{t) = 7i{t; A„)'a("') for a (;„ x 1 vector a*^"). We try to find the 
s in 5m (A„) that is close to g. Under reasonable smoothness conditions, g 
can be well approximated by elements in 5. Thus, the problem of estimating 
g becomes that of estimating a^"). 

Given a > 2 and A > 0, the SCAD penalty at 9 is 

(x\e\, \o\<x, 

px{e;a) = I -(02 _ 2aX\e\ + X'^)/[2{a - 1)], X<\e\< aA, 
[(a + l)AV2, \e\>aX. 

The SCAD penalty is continuously differentiable on (— cxo,0) U (0,oo) but 
singular at 0. Its derivative vanishes outside [— aA,aA]. As a consequence, 
SCAD penalized regression can produce sparse solutions and unbiased esti- 
mates for large coefficients. More details of the penalty can be found in Fan 
and Li (2001). 
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The penahzed least squares objective function for estimating /3'-"'' and 
a(") with the SCAD penalty is 

Q„(b("),a(");An,o,A„,m) 

= ||Y-X(")b(") -Z(")a(")f + nf]pA„(&f^;a)• 
j=l 

Let 

, al") ) = arg min Q„ (b(") , a^") ; A„ , a, A„ , m) . 

The SCAD-PLM estimators of f3 and g are 3„ and = Z(t; A„)'q^"\ 
respectively. 

The polynomial splines were also used by Huang (1999) in the partially 
linear Cox models. Some computational conveniences were also discussed 
there. We limit our search for the estimate of g to the space of polynomial 
splines of order m, instead of the larger space of piecewise polynomials of 
order m, with the goal to find a smooth estimator of g. Unlike the basis 
pursuit in nonparametric regression, no penalty is imposed on the estimator 
of the nonparametric part, as our interest lies in the variable selection with 
regard to the parametric part. 

For any b^"^ the a^") that minimizes Qn necessarily satisfies 

^(n)/^(n)jj(n) ^ ^(n)/ _ X^^^'b^")). 

Let P^^ = be the projection matrix of the column 

space of Z*^"). The profile objective function of the parametric part becomes 

Q4bW;A„,a,A„,m) = ||(/-p("))(Y-XWbH)f + n^p,„(6f);a). 

i=i 

(3) ^ ^ 

Then, /3„ = argminQnlb^"); A„, a, A„, m). Because the profile objective 
function does not involve a^") and has an explicit form, it is useful for both 
theoretical investigation and computation. We will use it to established the 

asymptotic properties of f3^ . Computationally, this expression can be used 

to first obtain ^ . Then, S^"^ can be computed using the resulting residuals 
as the response for the covariate matrix Z^"-). 

3. Asymptotic properties of the PLM-SCAD estimator. In this section 
we state the results of the asymptotic properties of the PLM-SCAD esti- 
mator. First, we define some notation. Let 6^p\t) = E[X^'^\t = t] for j = 
1, . . . ,Pn - Let the pn x Pn conditional variance-covariance matrix of (X(")|r = 
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t) be (i) . Let e(") = X(") - £'[X(") |r] . We can write S^") (t) = Var (e(") |r = 
t). Denote the unconditional variance-covariance matrix of e^") by H^"). We 
have = £;[S(")(r)]. We assume the following conditions on the smooth- 
ness of g and 6j^^ , 1 < i < • 

Condition 1. There are absolute constants 751 > and Me > 0, such 
that 

sup sup \ellj\t2)-9ij\ti)\<Me\t2-h\^' for 0<ti,t2<l, 

and the degree of the polynomial spline m — 1 >r0. Let sq = rg + ^g. 

Condition 2. There exists an absolute constant (T4e, such that, for all 
n and 1 < j < Pn, 

^jgW I j>j ^ almost surely. 

Condition 3. There are absolute constants 7^ > and Mg > 0, such 
that 

\g(^^){t2) - g^'-^Xh)] < Mg\t2 - for < h,t2 < 1, 

with Vg <m— 1 . Let Sg = rg + ^g. 

As in nonparametric regression, we allow M„ 00, but M„ = o(n). In 
addition, we assume that the tuning parameter A„ — > as ?i — > 00. This is 
the assumption adopted in nonconcave penalized regression [Fan and Peng 
(2004)]. For convenience, all the other conditions required for the conclusions 
in this section are listed here. 

(Al) (a) lim„^ooPn/ri = 0; (b) lim^^ooPnAf^/n^ = 0; (c) lim„_ooPn/ 
M^" = 0. 

(A2) The smallest eigenvalue of H^"), denoted by Amin(^''"^)) satisfies 
liminfA„in(H(")) = CA>0. 

n—>oc 

(A3) A„ = o(A;-'/'). 

(A4) liminf„^oo mini<j<fc„ |/3j"^| = c/5 > 0. 

(A5) Let Amax("^'^^ ) be the largest eigenvalue of H^") . (a) lim y'pnAmax(S*-"0/ 
(V^A„) = 0; (b) lim ytW(HW)/(M^''A„) = 0. 

(A6) Suppose for all t in [0, 1], tr(S^"^(t)) < tr(S^"]-^) and the latter sat- 
isfies lim y^tr(s2i)Mr7'^ = and limtr(s2i)M„/n = 0. 

(A7) lim^Af^^'''+"'^ =0. 
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Theorem 1 (Consistency of (3 ). Under (Al)-(A2), 

p^") - /3W|| = Op{.J^ + M-'^ + VKK). 

Thus, under (A1)-(A3), - || 0. 

This theorem estabhshes the consistency of the PLM-SCAD estimator 
of the parametric part, without the local restriction in Theorem 1 of Fan 
and Peng (2004) and Theorem 2 of Fan and Li (2004). (Al) requires the 
number of covariates considered not to increase at rates faster than -y/n and 

Ml''". (A2) is a requirement for model identifiability. It assumes that H^"' is 

positive definite, so that no random variable of the form X!j=i '^j^'^j^^ where 
Cj's are constants, can be functionally related to T. When p„ increases with 
n, H^"^ needs to be bounded away from any singular matrix. The assumption 
about A„, (A3), says that A„ should converge to fast enough so that the 
penalty would not introduce any bias. The rate at which goes to only 
depends on A;„. It is interesting to note that the smoothness index Sg of g 

— (n) 

and the number of spline bases M„ affects the rate of convergence of (3 
by contributing a term Mn When pn is bounded and no SCAD penalty 
is imposed (A„ = 0), the convergence rate is 0(n~^/^ + M„''®), which is 
consistent with Theorem 2 of Chen (1988). 

Corresponding to the partition in (1), write ^ = (/9[ \p2^ )', where 

j3i and are vectors of length kn and m^, respectively. The theorem 
below shows that all the covariates with zero coefficients can be detected 
simultaneously with probability tending to 1, provided that A.„ does not 
converge to too fast. 

Theorem 2 (Variable selection in X^")). Assume all the e^-^^'s support 

sets are contained in a compact set in TZ. Under (A1)-(A5), lim„_»oo P{P2 — 
0) = 1. 

(A5) puts restriction on the largest eigenvalue of H^"). In general, Amax(S^"-*) = 
0{pn), as can be seen from 

A„,ax(H("))<tr(HW)<p„^. 

There is the question of whether there exists a A„ that satisfies both (A3) 
and (A5). It can be checked that, if pn = o(n^/^), there exists A^, such that 
(A3) and (A5) hold. When kn is bounded, the existence of such A„ only 
requires that pn = o(n^/^). This relaxation also holds for the case when 
Amax(S^"^) is bounded from above. 
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— (n) 

By Theorem 2, degenerates at Om„, with probability converging to 

'"("■) 

1. We now consider the asymptotic distribution of f3i . According to the 
partition of /S*-"-' in (1), write X^") and H^'") in the block form: 



nxfc„nxm„ " 

Let An be a nonrandom l x A;„, matrix with full row rank, and 











"11 


"12 






"21 


"22 



Theorem 3 (Asymptotic distribution of /3 ). Suppose that all the sup- 
in) 

port sets of s are contained in a compact set in TZ,j = 1, . . . ,pn- Then, 
under (Al)~(A7), 

(4) V^S-i/2a„(3;") _/3(")) ^Ar(o„cT2lJ. 

The asymptotic distribution result can be used to construct asymptotic con- 
fidence intervals for any fixed number of coefficients simultaneously. 

In (4), we used the inverse of Xf ^'(J - P;^"^)xS''^ and that of Under 
assumption (A2), by Theorem 4.3.1 in Wang and Jia (1993), the smallest 

^(n) 

eigenvalue of S]^^ is no less than cx and bounded away from 0. By Lemma 1 
in the Appendix, x(^"^'(/ — P^"^)^!"^ is invertible with probability tending 
to 1. The invertibility of S„ then follows from the full row rank restriction 
on A„. 

(A6) may appear a little abrupt. It requires J2^=i'^^^{^j^^\'^ — to be 
less than the trace of a A:„ x kn matrix Ti^uii ^ ranges over [0, 1], which is 
considerably weaker than the assumption that S^"]]^ — S^"^(t) is a nonnega- 
tive definite matrix for any t G [0, 1]. We can also replace tr(S^"]]^) by k^ in 
the assumption, since for all t, X]j=i Var(e^"'^|T = t) < kn\fCl. (A7) requires 

(n) 

that g and ■ be smooth enough. Intuitively, a smooth g makes it easier 
to estimate /3. The smoothness requirement on ^j"^ also makes sense, since 

(n) 

this helps to remove effect of T on X - , and the estimation of /3 is based 
on the relationship 

Y - E^\T\ = (X - ^[X|r])/3 + e. 



HIGH-DIMENSIONAL PLM 



9 



We now consider the consistency of (jn- Suppose that T is an absolutely 
continuous random variable on [0, 1] with density fx- We use the L2 distance 




9n-9\\T = { / [gnit)-git)YfT{t)dt 



1/2 



This is the measure of distance between two functions that were used in 
Stone (1982, 1985). If our interest is confined to the estimation of /3^"\ we 
should choose large M„, unless computing comes into consideration. How- 
ever, too large an M„ would introduce too much variation and is detrimental 
to the estimation of g. 

Theorem 4 (Rate of convergence of gn). Suppose = o{y/n), frit) is 
bounded away from and infinity on [0, 1] and -^[e^] < 00. Under (A1)-(A5), 



Wdn - g\\T = Op{kn/Vn + \jMn/n + VhiM^ ^«). 

In the special case of bounded A;„, Theorem 4 simplifies to the well-known 
result in nonparametric regression: — g\\T = Op(-\/ Mn/n + Mn When 
Mn ~ n~^^^'^^3~^^\ the convergence rate is optimal. However, the feasibility 
of such a choice requires Sg > 1/2. To have the asymptotic normality of 

— (n) 

Pi hold simultaneously, we also need sg > 1/2. In the diverging /e„ case, 
the rate of conv6rgGnc6 is dctGrniined by knt Pn, Mn, Sg and sg jointly. 
With appropriate Sg, sg and pn, the rate of convergence can be n~^^'^kn + 

^l/(4s«+2)^_,J(2s,+l)^ 



4. Computation. The computation of the PLM-SCAD estimator involves 
the choice of A„. We first consider the estimation, as well as the standard 
error approximation of the estimator with a given A^, and then describe the 
generalized cross validation approach to choose appropriate A„ in the PLM. 

The computation of (/3 ,gn) requires the minimization of (2). The pro- 
jection approach adopted here converts this problem to the minimization 
of (3). In particular, given m and a partition A„, a basis of 5m(A„) is 
given by {Bni, ■ ■ ■ , Bng„)- The basis functions are evaluated at T,, i = 1, . . . , n, 
and form In Splus or R, this can be realized with the bs function. 
Regress each column of X^") and Y on Z„ separately. Denote the resid- 
uals by X^") and Y. The minimization of (3) is now a nonconcave pe- 
nalized regression problem, with observations (X^^^Y). So, the minorize- 
maximize (MM) algorithm described in Hunter and Li (2005) can be used 

to compute /3 . We also standardize the columns of X^"^ so the covari- 
ates with smaller variations will not be discriminated against. Once we 
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have computed (3 , the value of g at any t G [0, 1] is estimated by gn{t) = 

Z(t; A„)'(Z(")'Z("))-iZ(")'(Y - X(")3^"^). 

The standard errors of the nonzero components of (3 can be derived 
from the Hessian matrix. For details, see Hunter and Li (2005) or Fan and 
Li (2001). 

We choose A„ by minimizing the generalized cross validation score [Wahba 
(1990)] and fix a = 3.7, as suggested by Fan and Li (2001). Our preference 
of GOV over CV stems from as much its computation advantage as its com- 
parable performance to CV in model selection, which have been discussed 
in Tibshirani (1996) and Kim, Kim and Kim (2006). 

Note that here we use fixed partition A„ and m in estimating the nonpara- 
metric component g. Data-driven choice of them may be desirable, which 
inevitably requires a good estimator of /3^"^ . In our simulations, m = 4 (cubic 
splines) and Mn < 3 with even partition of [0, 1] serves the purpose well. 

5. Numerical studies. In this section, we illustrate the PLM-SCAD es- 
timator's finite sample properties with examples. Examples 1 is a simulated 
example, and Example 2 explores a real data set. Throughout, we use m = 4, 
Mn = 3 and the sample quantiles of Tj's as the knots. 

Example 1. In this study, we simulate n = 100 points Tj, i = 1, . . . , 100, 
from the uniform distribution on [0,1]. For each i, e^j's are simulated to 
be normally distributed with autocorrelated variance structure AR{p), such 
that 

Cov(eij,e,z) = pl-'-'l, l<j,l<W. 
formed as follows: 

= sin(2ri) + ea, Xi2 = (0.5 + Ti)-^ + e^s, 
= exp(Ti) + ei3, = (Ti - 0.7)^ + ei5, 

= T,{l+T^)~^ + e^e, X„ = ^l + T, + e„, 
= log{3Ti + 8) + eis , X,j = eij, j = 4, 9, 10. 
The response Yi is computed as 

10 

Yi = Y^ XijPij + g{Ti) +ei, i = 1, . . . , 100, 
i=i 

where j3j = j, I < j < 4, Pj = 0, 5 < j < 10, and e^'s are sampled from 
7V(0,1). For each p = 0, 0.2, 0.5, 0.8, we generated N = 100 data sets. For 
comparison, we apply the SCAD penalized regression method, treating Tj 



Xij's are then 
Xii - 
Xi'i - 
Xi6 - 

Xis - 
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as a linear predictor like Xj^'s. The corresponding estimator is abbreviated 
as LS-SCAD estimator. Also, profile least squares without variable selection 
(PLM), profile least squares using AIC for variable selection (PLM-AIC) and 
partially linear regression using Lasso (PLM-LASSO) are applied for com- 
parison. We investigate two different g{-) functions: Scenario 1, g{t) = cos(t), 
and Scenario 2, g{t) =cos(27rt). 

The results are summarized in Tables 1 and 2. Columns 3-6 in Table 1 
are the averages of the estimates of j = 1, . . . , 4, respectively. Column 7 
is the number of estimates of Pj, 5<j< 10, that are 0, averaged over 100 
simulations, and their medians are given in column 8. Column 9 only makes 
sense for the LS-SCAD estimator. It gives the percentage of times in the 100 
simulations in which the coefficient estimate of T equals 0. Model errors are 
computed as (/3 — /3)' Cov(X)(/3 — (3). Their medians are listed in the last 
column, followed by the model errors' standard deviations in parentheses. 

In Scenario 1, the nonparametric part g{T) = cos(T) can be fairly well 
approximated by a linear function on [0,1]. As a result, the LS-SCAD es- 
timator is expected to give good estimates. It is shown in Table 1 that 
the estimates of (3j,l <j < 4, are all very close to the underlying values. 
LS-SCAD and PLM-SCAD pick out the covariates with zero coefficients 
efficiently. LS-SCAD has similar performance to the PLM-AIC and PLM- 
SCAD in variable selection. On average, each time 83% of the covariates 
with zero coefficients are selected, and none of the covariates with nonzero 
coefficients are incorrectly chosen as trivial in the 100 simulations. PLM- 
LASSO already significantly shrinks the estimates before detecting all the 
coefficients equal to 0. In each design setting, about 2/3 of the time, the 
LS-SCAD method attributes no effect to T, which does have a quasi-linear 
effect on Y. This is due to the relatively small variation caused in g{T) (with 
a range less than 0.5), compared with the random variation. Despite this, it 
performs best with respect to the model error associated with the X part. 
PLM-SCAD outperforms PLM-AIC in model errors and is more competent 
than PLM-LASSO in variable selection. 

In Scenario 2, g{T) = cos{2nT). This change in g{T) makes it hard to 
have a linear approximation of g{T) on [0,1]. So, the LS-SCAD estimator 
is expected to fail in this situation. Besides, the variation in g{-) (with a 
range of 2) is relatively large compared to the variation in the error term. 
Thus, misspecification of g{T) introduces bias in estimating /3. In columns 
3-6, with respect to the LS-SCAD estimator, the estimates of the nonzero 
coefficients are clearly biased, and the biases become larger as the correlation 
between covariates increases. 

Table 2 summarizes the performance of the sandwich estimator of the 
standard error of the PLM-SCAD estimator for Scenario 1. Columns 2, 4, 
6 and 8 are the standard errors of < j < 4, in the 100 simulations, 
respectively, while columns 3, 5, 7 and 9 are the average of the standard 
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Table 1 

Example 1, comparison of estimators 



Estimator p f3i (32 fSa (34 K K %(g(T) = 0) MME( X 10"^) (SD) 



Scenario 1 



LS-SCAD 





0.982 


2 


006 


2.981 


4.002 


4.66 


5 


70 


5.48 


(4.41) 




0.2 


1.003 


1 


997 


2.980 


3.998 


4.36 


5 


53 


6.95 


(5.11) 




0.5 


1.012 


1 


997 


2.974 


4.014 


4.40 


5 


57 


7.63 


(6.24) 




0.8 


1.034 


2 


004 


2.938 


4.028 


4.61 


5 


66 


9.74 


(10.23) 


PLM 





0.988 


1 


984 


3.001 


4.005 











11.62 


(5.91) 




0.2 


1.015 


1 


970 


3.004 


3.998 











12.03 


(5.69) 




0.5 


1.026 


1 


965 


3.005 


3.996 











12.29 


(6.47) 




0.8 


1.048 


1 


948 


3.008 


3.994 











15.19 


(11.40) 


PLM-AIC 





0.989 


1 


984 


2.998 


4.005 


4.89 


5 





9.22 


(5.89) 




0.2 


1.019 


1 


969 


3.007 


3.998 


4.75 


5 





9.44 


(5.88) 




0.5 


1.027 


1 


964 


3.009 


4.001 


4.78 


5 





10.32 


(6.45) 




0.8 


1.046 


1 


949 


3.009 


4.005 


4.81 


5 





13.19 


(10.78) 


PLM-LASSO 





0.937 


1 


934 


2.946 


3.952 


2.42 


2 





8.00 


(5.89) 




0.2 


0.971 


1 


935 


2.971 


3.951 


2.48 


2 





8.98 


(5.31) 




0.5 


0.991 


1 


947 


2.992 


3.951 


2.92 


3 





8.44 


(5.68) 




0.8 


1.045 


1 


939 


3.011 


3.928 


3.56 


4 





10.52 


(10.58) 


PLM-SCAD 





0.988 


1 


985 


2.999 


4.004 


4.49 


5 





6.27 


(5.50) 




0.2 


1.017 


1 


970 


3.007 


3.997 


4.46 


5 





6.78 


(5.11) 




0.5 


1.027 


1 


965 


3.009 


3.998 


4.69 


5 





7.56 


(6.07) 




0.8 


1.045 


1 


948 


3.014 


4.001 


4.78 


5 





12.33 


(11.03) 


Scenario 2 
























LS-SCAD 





0.923 


2 


147 


3.066 


3.997 


4.55 


5 


67 


14.53 


(9.47) 




0.2 


0.925 


2 


145 


3.033 


3.968 


4.55 


5 


72 


12.05 


(10.72) 




0.5 


0.857 


2 


216 


3.005 


3.935 


4.43 


5 


58 


15.74 


(17.65) 




0.8 


0.606 


2 


559 


2.916 


3.871 


4.64 


5 


23 


59.70 


(59.16) 


PLM 





0.988 


1 


984 


3.001 


4.005 











11.65 


(5.93) 




0.2 


1.015 


1 


970 


3.004 


3.998 











11.97 


(5.70) 




0.5 


1.026 


1 


965 


3.005 


3.996 











12.31 


(6.49) 




0.8 


1.048 


1 


948 


3.008 


3.994 











15.17 


(11.41) 


PLM-AIC 





0.989 


1 


984 


2.998 


4.005 


4.89 


5 





9.39 


(5.92) 




0.2 


1.019 


1 


969 


3.007 


3.998 


4.75 


5 





9.44 


(5.93) 




0.5 


1.027 


1 


964 


3.009 


4.001 


4.78 


5 





10.17 


(6.48) 




0.8 


1.046 


1 


949 


3.009 


4.005 


4.81 


5 





13.17 


(10.75) 


PLM-LASSO 





0.937 


1 


934 


2.946 


3.952 


2.44 


2.5 





7.99 


(5.91) 




0.2 


0.971 


1 


935 


2.971 


3.951 


2.46 


3 





8.93 


(5.32) 




0.5 


0.991 


1 


947 


2.992 


3.951 


2.90 


3 





8.44 


(5.72) 




0.8 


1.045 


1 


939 


3.011 


3.928 


3.53 


4 





10.62 


(10.56) 


PLM-SCAD 





0.988 


1 


985 


2.999 


4.004 


4.49 


5 





6.29 


(5.54) 




0.2 


1.017 


1 


970 


3.007 


3.997 


4.46 


5 





6.76 


(5.10) 




0.5 


1.027 


1 


965 


3.009 


3.998 


4.69 


5 





7.57 


(6.09) 




0.8 


1.045 


1 


948 


3.014 


4.001 


4.78 


5 





12.70 


(11.02) 
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Table 2 

Example 1, standard errors of the PLM-SCAD estimates 



p 


50 (/3i) 


re(/3i) 


51? (/32) 















0.0954 


0.0947 


0.1062 


0.0975 


0.0988 


0.0970 


0.1058 


0.0966 


0.2 


0.1003 


0.0965 


0.0911 


0.1007 


0.1067 


0.0993 


0.1113 


0.0991 


0.5 


0.1132 


0.1102 


0.1128 


0.1249 


0.1264 


0.1248 


0.1318 


0.1139 


0.8 


0.1590 


0.1608 


0.1922 


0.2062 


0.2024 


0.2073 


0.2116 


0.1773 



deviation estimates of these coefficients, obtained via the Hessian matrices. 
It is seen that the sandwich estimator of the standard error works well, 
though it slightly underestimates the sampling variation. 

We have also examined the behavior of the PLM-SCAD estimator of 
g{-) in Scenario 2. The estimator performs well and is globally close to the 
true curves (plot not shown). In particular, its performance gets better as p 
decreases. 

Example 2. The PLM-SCAD estimation is implemented in the anal- 
ysis of the workers' wage data from Berndt (1991). This data set contains 
the wage information of 534 workers and their education, living region, gen- 
der, race, occupation and marriage status information. Also given are their 
years of experience. It is not appropriate to assume a linear relationship 
between years of experience and wage level. However, the main concern is 
how important the other variables are to wage. In particular, we consider 

14 

Y, = g{Ti) + Xijpj +ei, i = 1, . . . , 534, 
i=i 

where Yi is the ith. worker's wage, Tj is his years of experience, Xij is his 
jth variable and e^'s are i.i.d variables with mean and finite variance. 
There are 14 covariates besides the years of experience. Brief description of 
the variables, as well as the PLM-SCAD estimates of /3j's, can be found in 
Table 3. As a comparison, the estimates of /3j from the unpenalized PLM and 
Lasso-penalized PLM are given in the third and fourth columns, respectively. 
PLM-SCAD selects 10 of the 14 covariates, while PLM-LASSO keeps 12. 

6. Discussion. In this paper, we studied the SCAD-penalized method for 
variable selection and estimation in the PLM with a divergent number of 
covariates. B-spline basis functions are used for fitting the nonparametric 
part. Variable selection and coefficient estimation in the parametric part are 
achieved simultaneously. The oracle property of the PLM-SCAD estimator 
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Table 3 







Wage 


rjnfn PTnTntil p 






Variable 


Description 


P (SE) 


Plasso K^'^) 


PSCAD K^'^) 




Number of years of education 


0.621 (0.102) 


0.616 (0.086) 


0.645 (0.092) 


X2 


1 


= southern region, = otlier 


-0.451 (0.417) 


-0.313 (0.221) 


-0.206 (0.153) 




1 


= Female, = Male 


-1.956 (0.417) 


-1.790 (0.334) 


-2.010 (0.388) 


Xi 


1 


= union member. 


1.602 (0.508) 


1.343 (0.382) 


1.374 (0.419) 






= nonmember 








X5 


1 


= black, = other 


-0.869 (0.574) 


-0.516 (0.284) 


-0.428 (0.233) 


Xe 


1 


= Hispanic, = other 


-0.588 (0.868) 


-0.088 (0.060) 


O(-) 


X7 


1 


= management, = other 


3.433 (0.796) 


2.909 (0.523) 


3.316 (1.021) 


Xs 


1 


= sales, = other 


-0.498 (0.855) 


-0.311 (0.192) 


-0.057 (0.047) 


X9 


1 


= clerical, = other 


0.149 (0.683) 


O(-) 


O(-) 


-^10 


1 


= service, = other 


-0.468 (0.680) 


-0.494 (0.275) 


-0.223 (0.148) 


Xii 


1 


= professional, = other 


2.143 (0.731) 


1.781 (0.432) 


2.011 (0.526) 


X12 


1 


= manufacturing, = other 


1.162 (0.595) 


0.799 (0.329) 


0.843 (0.278) 




1 


= construction, = other 


0.678 (0.962) 


0.075 (0.048) 


o(-) 


Xi4 


1 


= married, = other 


-0.008 (0.421) 


O(-) 


O(-) 



Notes. Columns 3-5 are the estimates of Pj, j = 1, . . . , 14. Their corresponding standard 
errors are given in parentheses following them. 



of the parametric part was estabhshed, and consistency of the PLM-SCAD 
estimator of the nonparametric part was shown. 

We have focused on the case where there is one variable in the nonpara- 
metric part. Nonetheless, this may be extended to the case of d covariates 
Ti, . . . , Td- Specifically, consider the model 

(5) y = x(")'/3(")+5(ri,...,r,) + e. 

The PLM-SCAD estimator (/3^ \9n) can be obtained via 



{n Pn 



I " -rj ' / ,1 

1 .7=1 



Here, S is the space of all the d-variate functions on [0, l]'^ that meet some 
requirement of smoothness. In particular, we can take S to be the space of 
the products of the B-spline basis functions, then project X^") and Y onto 
this^space with this basis and perform the SCAD-penalized regression to Y 
on X^"). This has already been discussed in Friedman (1991). However, for 
large d and moderate sample size, even with very small M„,, this model may 
suffer from the "curse of dimensionality." 

A more parsimonious extension is the partially linear additive model 
(PLAM) 

(6) y = /i + x(")'/3H + ^5,(T0+e, 
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where E[gi{Ti)] = holds for I = 1, . . . ,d. To estimate /3 and gi, for each 
Ti, we first determine the partition A„;. For simphcity, we assume that the 
numbers of knots are and the mesh sizes are 0{M~^) for aU /. Suppose 

that X and Y are centered. The PLAM-SCAD estimator (/3 , (jni , • • • , dni) 
is then defined to be the minimizer of 

d H 2 o„ 



n 

E 

1=1 



i=l 



^ '■ a) 



subject to J27=i 'PiiTii) = and 0/ is an element of Smi^ni)- 

Under the assumptions similar to those for the PLM-SCAD estimator, 

— (n) 

f3 can be shown to possess the oracle property. Furthermore, if the joint 
distribution of (Ti, . . . , T^i) is absolutely continuous and its density is bounded 
away from and infinity on [0, 1]*^, following the proof of Lemma 7 in Stone 
(1985) and that of Theorem 4 here, we can obtain the same global consis- 
tency rate for each additive component, that is, 

Wdni-giWr, = Op{kn/V^+^Mn/n + ^M-'^), l = l,...,d. 

One way to compute the PLAM-SCAD estimator is the following. First, 
form the B-spline basis {Bn^, ^ < qn} as follows: the first Mn + m — 1 
components are the B-spline basis functions corresponding to Ti ignoring 
the intercept, the second M„ + m — 1 components corresponding to T2, and 
so on. The intercept is the last component. So here, g„ = dMn + dm — d+1. 
Now computation can proceed in a similar way to that for the PLM-SCAD 
estimator. 

Our results require that pn <n. While this condition is often satisfied in 
applications, there are important settings in which it is violated. For exam- 
ple, in studies with microarray data as covariate measurements, the number 
of genes (covariates) is typically greater than the sample size. Without any 
further assumptions on the structure of covariate matrix, the regression pa- 
rameter is in general not identifiable if p„ > n. It is an interesting topic of 
future research to identify conditions under which the PLM-SCAD estimator 
achieves consistent variable selection and asymptotic normality, even when 
Pn > n. 

APPENDIX 

Before embarking on proving the asymptotic results, we give an overview 
of how the proofs are related to those in Fan and Peng (2004). In their work, 
the subject under study is a local minimizer of the objective function. We 
look for conditions when the global minimizer enjoys the desirable prop- 
erties. In the absence of a nonparametric term, Huang, Horowitz and Ma 
(2008) solved this problem for the bridge estimator. Identifiability of each 
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component in X is a basic requirement in both works. The partial residual 
approach changes the partially linear model to a linear model by smoothing 
out the effect of T. Identifiability requires that none of the components in 
X or any of their linear combinations vanish after smoothing. With a diver- 
gent p, uniformness among the components of X in smoothness is necessary. 
Once a certain convergence rate is assured, the proofs for Theorems 2 and 3 
are similar to their proofs for consistent variable selection and efficient esti- 
mation. The proof of Theorem 4 combines the results in Stone (1985) about 
the convergence rate of nonparametric regression and the oracle property 
obtained in this paper. 

We now give the proofs of the results stated in Section 3. Write 



r(n) 



6»C«)(T)+E„ 



(0(")(r,)) 



, n "I" {^ij ) 



(n) 

and I — is written as W for simplicity. 



Lemma 1. Under (Al), yx^WX^/ 



0. 



Proof. For simphcity, write A^") = X(")'WX(")/n and C^") = A^") 



Note that 



e!") + 0„^(T), where = (e 



(n) 



\C. 



3l 



Jn)' in) 

n 



+ 



An)l r)in)in) 
^■j '^■l 



in)y 

e:,v''W6'["^(T) 



n n 



n 



n 



By Condition 2, ^[n-^e^f ej"^ 

2 



and 



2 = n-2£;{S[(eJ^'P^")e5))2|Z(")]} 



— 1 -I A / in) in)\ , _i 

n Var(e^- ) <n cr4e, smce 

(n)/p(n) (n)x2|y(n)l 



-(")l2 



n n n n 



■ n 



. 1 = 1 i' = l L=l I 



i = i' = L = l' , 
otherwise, 
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together with S^"''(Tj) < a^'^^ and i'z'ii 1) we have 

+ n-V4ei?[tr(p("))] 

< n~^<T4e(g^ + 3g'„). 

By Corollary 6.21 in Schumaker (1981) and the properties of least square 
regression, 

where Ci is a constant determined only by rg. By the Cauchy-Schwarz 
inequality and Cr inequality, we have 

WC^-^f = Op(j^l/n + plM^/n' +pIM-'^^). 

The convergence follows from (Al). □ 

Lemma 2. £;[tr(X(")'WX("))] = 0{npn). 

Proof. We have 

^[tr(X(")'WX("))] 

= £;[tr([E(") + 6>(")(T)]'W[e(") + ^("^(T)])] 

= ^[tr(E(")'WE(") + 2E(")'W6>(")(T) + 6>(")(T)'W6'(")(T))] 

= E{^[tr(E(")'WE(") + 2E(")'W6'(")(T) + 6'(")(T)'W6'(")(T))|T]} 

= E{^[tr(E(")'WE("))|T]} + ^[tr(6'(")(T)'W6'(")(T))] 

< E{^[tr(E(")'WE("))|T]} + CinpnMgM~^'o 



El E 



Pn 



■J 



■J 



■■E 



5:tr(WSS^(T)) 
■i=i 

1/2 



+ CmpnMeM-^'" 
+ CinpnMeM-^'o 



<npnaA +CinpnMgMj 



-2sg 



triAB)<X^,^{B)tr{A). 



Here, sg^ (T) = diag(sg)(Ti), . . . , sg)(T„)). □ 



Proof of Theorem 1. Let e = (ei, . . . ,e„)' and g(T) = (g(ri), . 



g{Tn))' . Since (3^ ^ minimizes (5n(t)^"^), it necessarily holds that Qn{P 



< 
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Qn{P^^^)- Rewriting this inequality, we have 

||WX(")(3^"^ -/3("))f -2(e + g(T))'WX(")(3^"^ -/3(")) < ^{a + l)Xl. 
Let Sn = n-V2[x(")'WX(")]V2(3(") _ /jC"))^ and 

cj„ = n-i/2[xW'wx(")]~^/2x(">W(e + g(T)). 
Then, \\Sn — i^nW^ < ll^nlP + 0-5kn{a + 1)A^. By the Cr inequahty, 
\\Snf < 2i\\Sn - Unf + ) < 4||a;„f + A;„(a + l)Xl. 

Examine 

||cj„||2 = + g(T))'WX(")[X(")'WX(")]~^X(")'W(£ + g(T)) 

— Inl + In2 + InS, 

where 

/„i = n-ie'WX(")[x(")'WX(")]"^X('^)'W£, 
/„2 = 2n-i£'WX(")[x(")'WX(")]"^X(")'Wg(T), 
/„3 = n-ig(T)'WX(")[x(")'WX(")]"^X(")'Wg(T). 

Now, /„i = £;[S(I„i|X("),T)]Op(l) =pnn-^0p{l). By the property of pro- 
jection matrices, 

In3 < n-ig(TyWg(T) = M„"2^«0(l). 

Thus, ||a;.„|| = Op(p„/n + Mn Furthermore, 

_ f3in)f = OpipJn + M~^^^ + KXl) 

fohows from Lemma 1 with (A2). Thus, (A3) immediately leads to the con- 
sistency. □ 

Lemma 3 (Rate of convergence). Suppose (Al)-(A4) hold. Then, 

Proof. Let u„ = ^Pn/n + M^^^ + ^/fc^A„ . When Un = o(mini<j<fc,^ \l3f\), 
with probability tending to 1, mini<j<fc^J/3j-"'^| > aA„. Given a sequence 
{hn : hn > 0} that converges to 0, partition TZ'P'^ \ {Op^} into shells {Sn,i,l = 



HIGH-DIMENSIONAL PLM 19 
0,1,...}, where S„j = {b^ : 2'-i/i„ < ||bW -/3(")|| <2'/i„}. Then, 

P(||3i"'-/3W||>2^/i„) 

<o(l)+ ^(3i"^G5n,«,||C(")||<c/2) 



<o(l) + ^p(' sup 2(e + g(T))'WX(")(b(") -/3(")) 

> inf (b^'^) - /3(") )'X(")' WX(") (b(") - /3('') ) , 

b(")GS„,i 

iicWii < 

II 11-2 

<Yp( sup (£ + g(T))'WX(")(b(") -/3(")) >22'-4ncA/i^) +o(l). 



since 



sup |(£ + g(T))'WX(")(b(") -/3("))| 



< 2^hnJE[{e + g(T))'WX{")X{")'W(£ + g(T))] 



< 2'+^/2/i„A/£;[e'WXWxWWe] +£;[g(T)'WXHX(")'Wg(T)] 



< C^npn + S[g(T)'Wg(T) tr(X(")XWW)] 

< 2' /i„C4 ( + n^M~'<' ) . 

Continuing the previous arguments, by the Markov inequahty, 

F(ii3r - /5«ii > 2'ft„) < 0(1) + 1: ^'^'^,!,f|^'""' ^ 

This shows that - = Op(Vp^ + v^/M^')- □ 

Proof of Theorem 2. Consider the partial derivatives of QniP^^'' + 
v(")). We assume ||v(")|| = Op{^pn/n + ^/p^Mn Suppose the support 
sets of e^"^ are ah contained in a compact set [— Cg , Cg] . For j = kn + l, ■ ■ ■ ,Pn, 



20 H. XIE AND J. HUANG 

if ||v(")||<A„„ 

9Qn(/3^"| + vW) ^ 2x!f WX^vW +2XP'W(e + g(T)) +nA„sgn(t;;")) 



max =2|xf")'WX(")v(" 
+i<j<P n 



<2||v(")|| max ||X^"^'WX(") | 

n 



<{^^i + ^M-'^)Op{l) 



X max ||Wx(;)||AVf^(X(")'WXW) 



(VP^ + '^VP^^n '')Op(l) V + Op(l) 



= \/pn(n + n2M„ '^«)A„,ax(HW)Op(l). 

So, this term is dominated by |//„3j, as long as 

\/raA„ AnM^" 
iim —j^^^=^^^= = oo and lim — ^^^^^^^= = oo, 

\/pnAmax(SW) y Pn Amax(S(") ) 

both of which are stated in (A5). To sift out ah the trivial components, we 
need 



P\ max |//„2j|>riA„/2^0. 

n / 

This is also implied by (A5), as can be seen from 



P\ max |//„2j| > nA„/2 

+i<i<P 

< 2-g[maxfc„+i<j<p„ |//n2,il] 



^ 2A/2^E^:A:„+i{^[e'WX}"^Xf ^'We] +£;[g(T)'WXf ^Xf ^'Wg(T)]} 



nAr, 
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The proof is now complete. □ 

Proof of Theorem 3. Let A„ be any txkn matrix with full row rank 
and Tin = A„A'„. From the variable selection conclusion, with probability 
tending to 1, we have 

^{n) _ ^(n) ^ [xW'wxS")]"^xS")'W(5(T) + e). 

We consider the limit distribution of 

— Inl + In2, 

where 

Inl = n-V2s-V2A„sW-i/\W'w5(T) 

and 

/„2 = n"V2s;;V2A„H('j)"'/'x(")'We. 

Note that the conclusion of Theorem 3 is equivalent to V„ A^(Ot, cr^/t). 
The first term /„i is a op(l) term under (A6) and (A7), as shown in 

Inl = n-V2s-V2A„Hi^ "'/'Ei">W5(T) 

+ n-V2s-V2A„HS^ "'/'0S")'(T)W5(T), 

= II nl + II n2, 

where 

\\IInlf=E\\IInlfOp{l) 

= n-i£;b(T)'WE!")HS^"'/'A;S-iA„HS';)"'^V^'W5(T)]0p(l) 

= n-ii?{^,(T)'Wi^[ES")HS^"'/'A'„S-iA„H(^)"'/V)'|T]W^,(T)} 
xOp(l) 

<?i-^^{5(T)'WS[eS"^hS"^~^eS"^'|T]W5(T)}Op(1) 
= n"iS{5(T)'WDiag(tr(H(^)"'E(^)(Ti)), . . . , 

tr(H;^~'sS)(r„)))W<7(T)}Op(l) 

<n-i||W^^(T)ftr(si';Jj 

= tr(sgjM-2««Op(l) = op(l) 
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and 

= nM-2(^3+"'')0(l). 
Decompose the second term /„2 as 

+ n-V2s;;V2A„H('r'^'0(")'(T)Ws, 

= IIInl + III n2 + 111 n3- 

Actually, the last two terms above are trivial: 

\\IIIn2f = n-iOp(l)E[tr(p(")E(")sS^ "'/'a:,S-1A„hS)"'/ V'P("))] 

<n-iOp(l)i?[tr(P^")ES")HS';)"V)'P^'^))] 

= n-iOp(l)E[tr(P^")E("^E(")')] 
<n-iOp(l)tr(s2i)i?[tr(/^"))] 

= tr(Ei';;i)M„/nOp(l)=op(l). 

||///„3||2 = n-iOp(l)E[tr(W6>i"\T)Hi"'~^6'S")'(T)W)] 
= A:„M-2-''Op(l) = op(l). 

So we focus on Illni = ?^~^^^5]„^^^A„h|^^^ ^ E^^^'e, since 
Var(///„i) = ^[Var(///„i|xW, T)] = a^/, 

and the infinitely small condition holds, provided E[£'^] < cxd, and by the 

Lindeberg-Feller central limit theorem we have Illni ^ iV(0„(j2/J. The 
conclusion follows from the Slutsky's theorem. □ 

Lemma 4. Sequences of random variables An and random vectors B„ 
satisfy E[Af-^\'Bn] = Op{u'^ , where {n„} is a sequence of positive numbers. 
Then, An = Op{un)- 

Proof. For any e > 0, there is some Mi, such that P(E[A^|B„] > 
Mml) < e/2. Let M| = 2Mi/e. Then, 

P{\An\ > M2Un) < P{\An\ > M2Un, E[Al\Bn] < Mml) 



HIGH-DIMENSIONAL PLM 
l2 113 1 ^ )i,f „,2 ^ 



23 



+ PiE[Ai\Bn]>Miui) 

< E[l(\A„\>M2Ur,)'^{E[Al\B„]<Miul)] +e/2 

= E{l(_E[Al\Bn]<Miul)E[l(\A„\>M2Un)\^n]} + ^/^ 



< E 



<£. 



+ e/2 



The arbitrariness of e impUes the conclusion. □ 

Proof of Theorem 4. The nonparametric component g{-) at a point 
t€ [0,1] is estimated with 

gn{t) = Z(t; A„)'(Z(")'Z("))"^Z(")'(Y - X^")^^"^ 
With probability tending to 1, 



5„(t)-5(t) = Z(i;A„y(zWz 
= Z(i;A„)'(Z(")'Z 
+ Z(i;A„)'(z(" 

-Z(t;A„)'(z(" 

-Z(i;A„/(Z(" 



i2 _ rr?? /'+^ ^/'+m2 . 



Consider — ^Hj^ = /[5n(i) — g{t)] frit) dt. Without further assumptions, 
by Lemma 9 in Stone (1985), \\Ini\\l = Op{Mn^''). When Mn = o{^/E), by 
Lemma 4 in Stone (1985), 



ln2 



|||T] = Op{Mn/n) and hence ||/„2||r = Op{Mn/n). 



When {9^^ {■),n > 1, 1 < j < kn} are uniformly bounded on [0, 1], 



'n3|lT 



lW||2 



< ||Z(t;A„)'(Z(")'z("))~^Z(")'6>f')(T)||^||35 

< [0(A:„) + Op(A:„M-2^«)] [Op(l)AC2s« + A:„/nOp(l)] 
= Op{l){knM-^^^ + kln-^). 



Similarly, 



\In4l < ||Z(t;A„)'(zW'z("))"iz(")'ES")||^||3i"^ -M^^f 
= 113^"^ -/3S")f ||Z(t;A„)'(Z(")'z("))-^z(")'ES")||^ 
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< Op{KMn/n)[Op{l)M-^'^ + kJnOp{l)] 
= Op(l)(M^-2-«A;„/n + Mnkl/n"). 
To sum up, when kn = o{y/n), we have \\gn — qWt = Op{k'^/n + Mn/n + 

knMn^''). □ 
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